This tutorial demonstrates how to use MOFA for the integration of single-cell multi-omics data.
We consider a dataset where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of mouse development. MOFA provides a method for delineating coordinated variation between the transcriptome and the epigenome.
The data set we use here is a simplified version of the original data set (only E7.5 cells) published in Nature. The full data set can be downloaded from this FTP.
In this vignette we skip all the data processing details as well as the model training part. We focus on the downstream characterisation of a pretrained MOFA model.
Load dependencies. Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages
library(data.table) # fast manipulation of data.frames
library(purrr) # pipes to make the code more readable
library(ggplot2)
library(MOFA2)
Define cell type colors for the visualisations
colors <- c(
"Mesoderm" = "#CD3278",
"Endoderm" = "#43CD80",
"Ectoderm" = "steelblue"
)
As input to the model we quantified DNA methylation and chromatin accessibility values over two different sets of regulatory elements: gene promoters and enhancer elements. RNA expression was quantified over protein-coding genes. After data processing, separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout.
Note that for this tutorial I selected the MOFA Factors that explained at least 1% of variation in the RNA expression.
# load(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/scnmt_gastrulation/gastrulation_scnmt_mofa.RData"))
MOFAobject <- readRDS("/Users/ricard/data/teaching_heidelberg/mofa/MOFAmodel.rds")
MOFAobject
## Trained MOFA with the following characteristics:
## Number of views: 5
## Views names: Enhancers accessibility Enhancers methylation Promoters accessibility Promoters methylation RNA
## Number of features (per view): 1500 1500 1500 1500 2500
## Number of groups: 1
## Groups names: single_group
## Number of samples (per group): 801
## Number of factors: 6
Explore the cell metadata:
- sample: cell ID
- stage: developmental stage.
- lineage: cell type annotation (derived from mapping the cells to the 10x reference atlas).
- pass_rnaQC: did the cell pass QC for RNA expression?.
- pass_metQC: did the cell pass QC for DNA methylation? NA if the cell was only profiled for RNA.
- pass_accQC: did the cell pass QC for chromatin accessibility? NA if the cell was only profiled for RNA.
- group: ignore this column
colnames(samples_metadata(MOFAobject))
## [1] "sample" "stage" "stage_lineage" "lineage"
## [5] "pass_rnaQC" "pass_metQC" "pass_accQC" "group"
table(samples_metadata(MOFAobject)$lineage)
##
## Ectoderm Endoderm Mesoderm
## 189 131 481
Notice that there a lot of cells that only have RNA expression measurements. One of the major advantages of MOFA is that it handles missing values, so we don’t have to remove these cells prior to model training
mean(!is.na(samples_metadata(MOFAobject)$pass_metQC))
## [1] 0.4069913
The function plot_data_overview can be used to obtain an overview of the input data.
It shows how many views (rows) and how many cells (columns) exist, what are their corresponding dimensionalities are. It also shows which views each cell is missing.
view.colors <- c(
"RNA" = "#3CB54E",
"Enhancers accessibility" = "#00BFC4",
"Promoters accessibility" = "#00BFC4",
"Enhancers methylation" = "#F37A71",
"Promoters methylation" = "#F37A71"
)
view.colors = view.colors[views_names(MOFAobject)]
plot_data_overview(MOFAobject, colors = view.colors)
Notice that RNA expression has already been normalised (using scran), but the distribution looks zero-inflated. This is a statistical challenge for some single-cell RNA-seq assays. To model this data we probably want to use a zero-inflated Gaussian distribution, but this not implemented in the MOFA framework. Instead, we used a Gaussian distribution, which should provide a decent approximation but it will likely underestimate the mean expression values.
rna <- get_data(MOFAobject, views="RNA")[[1]][[1]]
hist(rna)
As seen in the previous session, we use M-values instead of Beta-values.
However, when looking at the distribution of M-values we can notice that the epigenetic modalities are not well modelled by a Gaussian likelihood. Ideally this data modality should be modelled with a binomial distribution: for every cell \(i\) and region \(j\) the total number of CpGs correspond to the total number of trials and the number of methylated CpGs to the number of successes.
Unfortunately, the binomial likelihood is not implemented in the MOFA framework and thus we need to calculate M-values that can be approximated with a Gaussian distribution.
met <- get_data(MOFAobject, views="Promoters methylation")[[1]][[1]]
hist(met)
acc <- get_data(MOFAobject, views="Promoters accessibility")[[1]][[1]]
hist(acc)
A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit.
plot_factor_cor(MOFAobject)
The most important insight that MOFA generates is the variance decomposition analysis using plot_variance_explained. This plot shows the percentage of variance explained by each factor in each data modality. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.
plot_variance_explained(MOFAobject) +
theme(
axis.text.x = element_text(angle=25, hjust=1, vjust=1.05)
)
What insights from the data can we learn just from inspecting this plot?
(Q) What do you notice about promoters?
There are a few systematic strategies to characterise the molecular signal that underlies each MOFA factor:
correlate_factors_with_covariatesplot_factor (one factor at a time) and plot_factors (combinations of factors)plot_weights (all weights), plot_top_weights (only the top weights)run_enrichment, followed by plot_enrichment.Plotting Factor 1 values and colouring cells by lineage assignment cleartly shows that this factor captures the variation that is associated with the separation between Mesoderm (positive Factor values) and non-Mesoderm cells (negative Factor values).
plot_factor(MOFAobject,
factor = 1,
color_by = "lineage",
add_violin = TRUE,
dodge = TRUE
) + scale_fill_manual(values=colors)
How do we interpret the factor values?
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.
The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature is more active in the cells with positive factor values, and viceversa.
Let’s plot the distribution of weights for Factor 1.
plot_weights(MOFAobject,
view = "RNA",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
If you are not interested in the full distribution, but just on the top weights, you can instead do:
plot_top_weights(MOFAobject,
view = "RNA",
factor = 1,
nfeatures = 10,
scale = T,
abs = T
)
We expect that genes with large positive weights For Factor 1 to be highlighy expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes with positive weight:
genes <- c("Phlda2","Mesp1")
for (i in genes) {
plot_factor(MOFAobject,
factor = 1,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
Similarly, we expect that genes with large negative weights For Factor 1 to be lowly expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes with negative weight:
genes <- c("Cldn6","Pim2")
for (i in genes) {
plot_factor(MOFAobject,
factor = 1,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
The weights are useful to identify which genes are driving each factors. After inspecting the weights it is good practice to go back to the high-dimensional space and check if the variability that MOFA captures is real.
For example, one could generate a heatmap plot of the RNA expression for the top genes, where samples are sorted by the corresponding factor values. This is the aim of the plot_data_heatmap function:
plot_data_heatmap(MOFAobject,
view = "RNA",
factor = 1,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`,
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE
)
An interesting option of plot_data_heatmap is to plot “denoised” observations. This is obtained by reconstructing the data using the matrix factorisation equation from MOFA:
\[\hat{\mathbf{Y}}^m = \mathbf{W}^m\mathbf{Z}\]
where \(\mathbf{W}^m\) is the weight matrix for the \(m\)-th view, and \(\mathbf{Z}\) is the (shared) factor matrix.
This data reconstruction step essentially removes all the variation that is not captured by the model:
plot_data_heatmap(MOFAobject,
view = "RNA",
factor = 1,
denoise = TRUE,
features = 25,
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F
)
As we have done with RNA, we can also visualise the distribution of weights for the epigenetic modalities. The problem about this is that the large majority of enhancers are not well annotated and we only have the genomic coordinates for them…
plot_weights(MOFAobject,
view = c("Enhancers methylation"),
factor = 1,
nfeatures = 5,
scale = F
)
As done with the RNA above, let’s visualise in the high-dimensional space the DNA methylation variation that MOFA captures using the plot_data_heatmap function. Notice how noisy and sparse DNA methylation data is.
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 1,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
We will use MOFA to impute the missing values. This is based on the data reconstruction equation shown above.
MOFAobject <- impute(MOFAobject)
Plot heatmap with impute=TRUE argument. Despite the gaussian likelihood model not being optimal for DNA methylation data, this heatmap looks much better!
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 1,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
As we guessed from the variance decomposition analysis, the promoters do not display interesting signal during germ layer commitment
plot_data_heatmap(MOFAobject,
view = "Promoters methylation",
factor = 1,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
(Q) Your task is to provide a characterisation for Factor 2.
Try a similar pipeline as for Factor 1 and answer the following questions:
impute)This scatter plot captures in two dimensions all the variation that is required to separate the three germ layers! It corresponds to Figure 2b in the paper.
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "lineage",
dot_size = 2,
legend = F
) + scale_fill_manual(values=colors)
We can colour this plot by the molecular measurements, for example gene expression:
# Ectoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Pim2",
dot_size = 2
)
# Mesoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Mesp1",
dot_size = 2
)
# Endoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Foxa2",
dot_size = 2
)
(Q) Extract (1) the RNA expression data from the mofa model using the get_data function, and (2) the MOFA RNA expression predictions using the predict function. Then, plot a histogram of the two matrices. What do you notice?
Visualisation of Factor values
plot_factor(MOFAobject,
factor = 2,
color_by = "lineage",
add_violin = TRUE,
dodge = TRUE
) + scale_fill_manual(values=colors)
Visualisation of RNA weights
plot_weights(MOFAobject,
view = "RNA",
factor = 2,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
Plot the expression of the genes with largest weight
genes <- c("Foxa2","Krt8")
for (i in genes) {
plot_factor(MOFAobject,
factor = 2,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
Visualisation of DNA methylation patterns in the high-dimensional space
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 2,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
Do predictions
rna.true <- get_data(MOFAobject, views="RNA")[[1]][[1]]
rna.pred <- predict(MOFAobject, views="RNA")[[1]][[1]]
Plot histograms. Notice that the Gaussian likelihood model implemented in MOFA (red) underestimates the mean of zero-inflated data (blue).
hist(rna.true, col=rgb(0,0,1,0.5), main="")
hist(rna.pred, col=rgb(1,0,0,0.5), add=T)
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOFA2_1.0 ggplot2_3.3.0 purrr_0.3.3 data.table_1.12.8
## [5] BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.8.2 Rcpp_1.0.4 lattice_0.20-41
## [4] tidyr_1.0.2 assertthat_0.2.1 digest_0.6.25
## [7] R6_2.4.1 plyr_1.8.6 stats4_3.6.3
## [10] evaluate_0.14 pillar_1.4.3 rlang_0.4.5
## [13] S4Vectors_0.24.3 Matrix_1.2-18 reticulate_1.15
## [16] rmarkdown_2.1 splines_3.6.3 labeling_0.3
## [19] BiocParallel_1.20.1 stringr_1.4.0 pheatmap_1.0.12
## [22] munsell_0.5.0 DelayedArray_0.12.2 HDF5Array_1.14.3
## [25] compiler_3.6.3 xfun_0.12 pkgconfig_2.0.3
## [28] BiocGenerics_0.32.0 mgcv_1.8-31 htmltools_0.4.0
## [31] tidyselect_1.0.0 tibble_3.0.0 bookdown_0.18
## [34] IRanges_2.20.2 matrixStats_0.56.0 fansi_0.4.1
## [37] crayon_1.3.4 dplyr_0.8.5 withr_2.1.2
## [40] grid_3.6.3 nlme_3.1-145 jsonlite_1.6.1
## [43] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5
## [46] scales_1.1.0 cli_2.0.2 stringi_1.4.6
## [49] farver_2.0.3 reshape2_1.4.3 ellipsis_0.3.0
## [52] vctrs_0.2.4 cowplot_1.0.0 Rhdf5lib_1.8.0
## [55] RColorBrewer_1.1-2 tools_3.6.3 forcats_0.5.0
## [58] glue_1.4.0 parallel_3.6.3 yaml_2.2.1
## [61] colorspace_1.4-1 rhdf5_2.30.1 BiocManager_1.30.10
## [64] corrplot_0.84 knitr_1.28